Any stored information in R is an object. So, when you type the minimum age for drinking in Maharashtra (like shown below),
## [1] 25
the number, while it temporarily exists as an unnamed entity, is not stored in R until you define it as an object. You need to assign a name to an object if you want a particular number or a name to be stored in R. The assignment is done via the operator \(\gets\).
## [1] 25
In this course, we will be dealing with the following object types:
Whole numbers and their negative counterparts. R recognizes any number as an integer if you end that number by the letter ’L’. Example:
Non-numeric variables which can be created by sandwiching any given string between quotes. Example:
This object-type defines a true or false condition. Example: Suppose you wanted to know if Anthony Gonsalves is \(\textit{akela}\) (alone) in \(\textit{duniya}\) (world).
Categorical information are stored as factors. At this point, I will proceed without giving you an example in \(\texttt{R}\), but it will help to know that variables like color, gender, race, caste, etc. are typically stored as factor objects.
\(\texttt{R}\) is case-sensitive.
Names cannot contain any special character except for underscore or period1.
\(\texttt{R}\) overwrites objects.
The list of objects can be gleaned by typing ls() in the console
Any work with data is incomplete without functions. Recall your excel lessons where you learnt different functions to perform operations. Same goes for \(\texttt{R}\). To execute any task you need to run a function. The setup of a function is very simple. It is of the following format: you need to call the name of the function and sandwich the operation (formally known as the argument) within parentheses.
\(\texttt{NAME OF THE FUNCTION(YOUR ARGUMENTS GO HERE)}\)
Let’s discuss some basic functions in \(\texttt{R}\).
print(): prints any given object (stored or otherwise).## [1] 25
## [1] "anthony gonsalves"
mean(): computes the average of any numerical object.## [1] 5.5
round(): rounds off a number or a set of numbers.## [1] 2
factorial(): computes factorial for any given number.## [1] 40320
sample(): draws a random sample from any given object.## [1] 1 7
As discussed above any given function has arguments. These arguments can be classified into required (without this, you will run into problems) and optional. Let’s consider an example. You have a number 2.3942, and you want it to be rounded off to two decimal places. You will need to specify the optional argument digits = into round() function2.
## [1] 2.39
You can see the detailed layout of a function by using args(). Let’s see how this works.
## function (x, size, replace = FALSE, prob = NULL)
## NULL
Upon inspecting the output, you can see that the function sample() in \(\texttt{R}\) has four arguments- x, size, replace,prob = NULL. To know more about each of these arguments, you can type help("sample") in the console.
When you create an object in \(\texttt{R}\), you should try to know more about the object. One way is to look at the \(\texttt{Global Environment}\) window of \(\texttt{R Studio}\) which contains the following information: \(\texttt{Name, Type, Length, Size, Value}\).
The other way to do the same (recommended) is to use a set of functions in \(\texttt{R}\).
class(): returns the class of an object.## [1] "character"
## [1] "numeric"
str(): returns the class of an object, its size, and gives you a quick snapshot of different components of the object. This is especially useful when you are dealing with a large dataset.## chr "anthony gonsalves"
## num 25
typeof(): returns the specific type of object under a given class.## [1] "character"
## [1] "double"
So, you can see that while the class of the object min_age is numeric, the type is double.
Operations in \(\texttt{R}\) are usually dependent upon packages in \(\texttt{R}\). How do you know which package to install? Well, google/stackoverflow is the answer! Let’s try to install a package.To install a new package into \(\texttt{R}\), you need to use: the following syntax- install.packages("PACKAGE NAME HERE", OPTIONAL ARGUMENT). Once you install a package, you need to load it to use it for your work. The syntax is library(PACKAGE NAME HERE). Please note that your installed packages should be called via library command without wrapping the package name into quotes. tl;dr version- you need to deploy two commands install.packages() and library() to use a package. We want to cut down our time on this work. Therefore, I recommend that you install a package called pacman which will do the two tasks of installing and loading packages in one go as shown below. Once you have installed and loaded the package pacman, you are ready to use the function called p_load() which lets you install and load as many packages as you want3! You should run install.packages("pacman", dependencies = TRUE) before you compile the following chunk.
Our datasets are stored in directories or folders of our machines. We would want \(\texttt{R}\) to know about the directory structure, and work from a particular directory known as the working directory. We should be able to set the working directory, call the working directory, list files in the folder, and change the working directory.
setwd(FOLDER PATH HERE)getwd()setwd(~)setwd("..")list.files() or dir()list.files(pattern = "xlsx")Vectors are one dimensional objects that represent some information. Examples include age, GDP, profits, etc. You can create:
## [1] 10 15 20
The vector nv contains three objects, each indexed by the order in which it appears. Any object within a vector can be called by using the following syntax: VECTOR[INDEX] where INDEX is a whole number4.
## [1] 10
## [1] 15
rep() and seq() functionsrep(): repeat a number or a set of numbers. You will need to specify two arguments: x (the number or the set), n (the number of repetitions)seq(): generates a sequence of numbers. You will need to specify three arguments from = (starting point), to = (end point), and by = (common difference).In the examples below, I have repeated the vector nv twice, and generated an arithmetic progression for numbers between 2 and 30 with a common difference of 4.
## [1] 10 15 20 10 15 20
## [1] 2 6 10 14 18 22 26 30
When I introduced logical vectors in class, most of you were nonplussed. Logical operators are, however, extremely usefuly because of the fact all of conditional logic rests on the usage of logical operators. For example, you have a large dataset containing information on customers belonging to all geographies, and you want to analyse data for a particular city. Before we learn how to subset and modify vectors, it is important to learn some of the operators. As we saw with the Anthony Gonsalves example, logical operators in \(\texttt{R}\) check for condition.
Let’s see how this works.
stu_age <- 23 #someone's age
stu_age == min_age #check if her age is above legal drinking in Maharashtra## [1] FALSE
An example:
stu_age <- 23 #someone's age
stu_age != min_age #check if her age is above legal drinking in Maharashtra## [1] TRUE
Let’s say that you want to eat out in a restaurant which restricts access to alcohol if you are a man (beside state regulation).
is_female <- T # you happen to be a woman
stu_age <- 23 #your age
is_female == F & stu_age >= 25 #check if you're allowed to drink## [1] FALSE
In a university, a professor is eligible for promotion if she publishes more than 4 papers over a period of three years or an average rating of 4 on teaching evaluations.
num_papers <- 3 #number of papers
teach_rating <- 4.2 #average rating
num_papers >= 4 | teach_rating >= 4 #check if eligible for promotion## [1] TRUE
%in%: the `contained in’ operator.Example: check if the numbers 16 and 18 are contained in the sequence ap that we generated earlier.
## [1] FALSE TRUE
Apart from logical operators, it is useful to know some logical functions.
any(): checks if any of the objects meet a condition.## [1] TRUE
all(): checks if all of the objects meet a particular condition.## [1] FALSE
which(): tells you which of the objects of a vector meet a given condition.## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [1] 5
The output tells you that the fifth object of the vector ap is contained in the vector c(12,16,18).
Now that we have learnt logical operators and functions, we are all set to subset a vector (and create a new vector). Consider the following example. To subset, we will use the following syntax: VECTOR[LOGICAL OPERATION].
## [1] 2 8
## [1] 14 20 26 32 38 44 50 56
## [1] 14 20 26 32 38 44
## [1] 14 20 26 32 38
## [1] 2 8 14 20 26 32 38 44 56
## [1] 2 8 56
## numeric(0)
Suppose that you are give a list of Beatles songs, and you notice that someone messed up norwegian wood’s year (the actual year is 1965).
beatles.songs <- c("please please me",
"magical mystery tour",
"norwegian wood")
year <- c(1963, 1967, 1963)
names(year) <- beatles.songs # the function names() assigns names to objects in a numeric vector
print(year)## please please me magical mystery tour norwegian wood
## 1963 1967 1963
Let’s fix it.
## please please me magical mystery tour norwegian wood
## 1963 1967 1965
You can summarize a vector using different functions. Consider, as an example, our old friend ap.
## [1] "numeric"
## [1] 10
## [1] 56
## [1] 2
## [1] 290
## [1] 29
## [1] 330
## 0% 25% 50% 75% 100%
## 2.0 15.5 29.0 42.5 56.0
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 2.0 7.4 12.8 18.2 23.6 29.0 34.4 39.8 45.2 50.6 56.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 15.5 29.0 29.0 42.5 56.0
The name of any object cannot start with any of the special character including the underscore (_)↩︎
by default round() will use digits = 0 as you saw in the previous example.↩︎
If you wish to see the list of installed packages, you can type installed.packages() on the console.↩︎
indexing will work for character vectors as well.↩︎